FEATURE ARTICLE by George Novacek 



Fault-Tolerant Electronic Systems 

There's no escaping it: sooner or later, all electronic systems fail. As a designer of embed- 
ded systems, how can you prepare for system failures? Is there a way to plan for the unfore- 
seen glitches? George's article is a must-read for anyone with an electronic system in the 
pipeline. 



Electronic systems, which are now 
more sophisticated than ever, provide us 
with functions and systems intelligence 
that was undreamed of only a few years 
ago. As users of these marvelous devices, 
we have become so reliant on their flaw- 
less operation that we can no longer 
imagine how we could have ever func- 
tioned without them. Unfortimately, all 
_^ electronic systems fail at one time or 
another. The question is not if, but 
rather when and how they will fail. 
When the malfunction occurs, what wiU 
be its outcome? WiU it be merely the 
loss of the function, which may be frus- 
trating but usually acceptable, or will 
the malfunction have impredictable, 
potentially catastrophic results? Will it 
be only a nuisance, like not being able to 
watch the latest DVD movie, or wiU you 
be faced with a critical, perhaps cata- 
strophic, event, potentially causing a 
loss of property or even a loss of life? 

A good example of the future focus 
on fault-tolerant embedded controllers 
may be the "X-by-wire" automobile 
that's expected to appear on the mar- 
ket by 2005. In the new automobile, 
many currently mechanical functions 
will be performed by electronics (e.g., 
'steering and braking). Can you imagine 
having to do the infamous three-finger 
salute at 60 mph while the steering 
has decided to have a mind of its own? 

Electronic systems failures can take 
several forms. The most desirable, of 
^ course, is fail-operational, where the 
system doesn't as much as hiccup and 
continues working as if nothing has 
happened. Unfortunately, for some 



applications, the cost of such a system 
may be prohibitive. 

Fail-passive is the next best thing. 
Following a fail-passive failtire, the 
system output assumes some predetCT- 
mined desirable state, typically a 
power disconnect. Often, the system 
has a human in the loop, such as an 
aircraft pilot, and reverts to manual 
control or some benign state. 

The third condition, fail-active, is 
highly imdesirable. Although it can't be 
prevented with 100% certainty, it is 
usually allowed as a smaU probabihty, 
typically only 10"'. In the fail-active con- 
dition, the actuators remain active but 
uncontrolled yntii unpiedictable results. 



a) 




Figure 1a— 77)e basic syHem is comprised of input 
deuces, outpit devices, and a contmber. Not all such 
^sterns requim a feedbacit, btt itgves you some 
seemHy. b— Adding a BIT gives you a syslm capable 
of Mi>issive operation. 



The causes of failtires can be broadly 
classified into two categories. The first 
contains failures attributable to weak 
design. The design may not be robust 
m.cm^ for die function or the operat- 
ing environment. Software bugs or 
other design errors fall under the same 
category.- The reasons for a weak design 
may be many, but the problem can be 
minimized through good engineering 
practice, which starts with properly 
defined and understood specification, 
and continues through design reviews 
and analyses such as fault tree analysis 
(FTA), failure modes and effects analy- 
sis (FMEA), reUability prediction, etc., 
which are performed concurrently with 
the design. Finally, thorough, exhaus- 
tive testing will validate the design. 

Failures in the second category are 
unavoidable. Whether you like it or 
not, failures happen: a wire breaks, a 
component wears out, or a bolt of light- 
ning strikes the equipment. You can't 
prevent them, but you can be prepared 
by designing the system architecture in 
such a way that the effects of a failure 
are predictable and not catastrophic. 
That's what this article is about. I'U 
show you the different ways of mak- 
ing your equipment fault-tolerant. 

It is generally accepted that redtmdan- 
cy, along with proper design, is the cure 
for loss of functionaHty because of equip- 
ment faults. You must remember, how- 
ever, that there may be conditions that 
cannot be foreseen or that no practical 
redundant design can handle. Typically, 
if the equipment is closed to a high- 
intensity radiated field (HIRF) beyond 



36 Issue 162 Januaiy2004 



CIRCUIT CELLAR* 



WMiMiireiiilcillar.c<^ 



nonnally expected levels, input signals 

in all the redundant channels can be 
obliterated simultaneously or a mechani- 

1 failure can sever several harnesses. 
The goal is to design equipment that can 
handle imforeseen conditions gracefully. 
In case of an HIRF, for e:^ample, your 
fail-operational controller might have to 
go into fail-passive mode, or hold the last 
position imtil the interference disap- 
pears. You should never consider a situa- 
tion when your embedded controller 
goes out of control to be acceptable. 

LETS START SIMPLE 

In a basic embedded control system, 
and it makes no difference whether it's 
a position control or a commtmications 
terminal, you have at least one input 
and one output device in addition to a 
processing imit, typically a microcon- 
troller with p^ipherals such as memo- 
ry, address decoding, and glue logic (see 
Figure la). The feedback is not always 
necessary, although it's a good idea 
because it provides the means for estab- 
lishing that the process does what you 
want it to do. Don't assume it needs to 
' °, used only in closed-loop control sys- 
.;ms. You may use the same principle 
just as well in a communications ter- 
minal. For instance, by looping the 
transmitted data back to the terminal, 
you can see if what you have put on the 
communications lines is in fact what 
you had intended. 

You can improve this rudimentary 
design by including software built-in 
test (BIT) routines, having the microcon- 
troller validate and type the input data, 
test the condition of the input and out- 
put devices, and vahdate the outputs. 
You can see this ftmdamental approach 
in evCTy PC. By now it has become the 



minimtun acceptable standard for any 

embedded controller. Many of today's 
input and output devices are BTT-able, 
which means they provide a way for the 
processor to verify their health. Some 
have a separ^ line, while others d^ine 
a healthy operating current range, and so 
on. The same can be said of some micro- 
electronics with internal BIT and one 
pin dedicated to signaling its status. 

The internal watchdog timer 
(WDT), which is pretty much standard 
in every microcontroller today, can, to 
some degree, keep an eye on the pro- 
gram execution. The usefulness of the 
internal WDT is questionable because 
it can be disabled by the software it is 
supposed to monitor, and because it 
resides on the same substrate with the 
microcontroller that it is supposed to 
watch. For a Uttle extra money, I pre- 
fer using an external WDT. Many are 
available, and they also include cir- 
cuits for monitoring a power supply. 

If the program execution goes off the 
rail for some reason, the microcontroller 
wiU fail (hopefully) to reset the WDT, 
and, in turfe, the WDT wiU attempt to 
reset the microcontroller and put it 
back on track. If the microcontroller 
detects through its BIT routines a prob- 
lem with a peripheral device, you can 
have it attempt to shut down the sys- 
tem or go into some predeterminfed 
state. This is all very well as long as the 
microcontroller operates properly and 
the critical peripheral devices can be 
controlled. If not, the system can poten- 
tially go fail-active, which is the one sit- 
uation you must try to avoid. 

BUILT-IN TEST 

The solution to the afor^nentipned 
predicamoit is adding BIT equipm^t 



(BITE) that's external to the processor 

(see Figure lb). The BITE independently 
monitors the performance of the process, 
induding the microcontrollCT. The oper- 
ative word here is "independently." 
Being independent makes the BITE capa- 
ble of detecting faults within the micro- 
controller itself. In some situations, you 
may have to feed it the input signals 
as well, and, as is shown in Figure lb, 
you will want the BITE to be able to 
independently shut down the outputs. 

The BITE performs several functions. 
First, during power-up, it performs the 
power-up BIT (P-BIT), which verifies the 
memory integrity and exercises a ntmiber 
of functions that could not be exercised 
during normal operation ( e.g., the WDT). 
Make sure you don't allow the P-BIT to 
overwrite stored data or cause the inad- 
vertent movement of mechanical parts 
that could potentially injure someone. 

The P-BIT is generally allowed to 
take a few seconds, but you may want 
to have two versions of it: a full P-BIT 
for cold boot and a short version for 
warm boot, when the several seconds 
of delay might not be acceptable after 
a manual command to re-enable. 

The continuous BIT (C-BIT), which 
runs in the background during normal 
operation, is the monitor that validates 
data, checks for its plausibility, scans 
peripheral devices for integrity, and so 
on. If the C-BIT detects a problem, it 
can either initiate a fault-handler rou- 
tine or activate the actuator disable (or 
channel control) switch. Some systems 
also provide initiated BITs (I-BIT), 
which are test routines that can be ini- 
tiated manually during system main- 
tenance to test the acciuacy, control 
range, rigging, and so on. 

FAULT PROPAGATION 

Maintaining the physical independ- 
ence of the processor and the BITE cir- 
cuits is paramount. A common error is 
to use devices from a multiple-device 
package, such as a quad op-amp, in both 
the process and monitor circuits. This 
may save some money and PCB real 
estate, but it is an absolute no-no. You 
must also ensure that a fault caimot 
propagate between functional blocks. 

Consider the circuit diagram in 
Figure 2. A fault within the sensor 
could cause its entire 28-VDC power to 



+6VDC +5VDC'. 




Figure 2— Always make sure faults, whether external or internal, cannot propagate through the system and 



www.circuitcellar.com 



CIRCUIT CELLAR* 



Issue 162 Januaiy2004 37 



be placed on the multiplexer (MUX) 
input and destroy it, Ottea, this can be 

prevented by placing a resistor, R, in 
series with the source to limit the maxi- 
mum current, which, together with 
diodes Dl and D2, keeps the input volt- 
age to the MUX within safe limits. 
Many devices have lie protection diodes 
already incorporated on the substrate, 
although I prefer to use discrete devices. 

When using a single input source for 
two channels it may be tempting to use 
just one resistor and connect the two 
MUX inputs in parallel. Don't try to 
save the few cents. Use two resistors, R 
and R', as shown in Figure 2. Make sure 
that a fault within MUX A does not 
propagate to MUX B and vice versa. 

Sometimes a resistor won't do the 
trick. Suppose a sample-and-hold cir- 
cuit requiring fairly low source imped- 
ance follows the MUX and you have 
to use a buffer. It would be tempting 
to feed the buffer, usually an op-amp, 
from the existing +12-V analog power. 
But what would happen if the buffer 
fails? It could feed the 12 V into the 
sample-and-hold circuit, which caimot 
sustain it, and thus cause the chain 
destruction of the entire channel. 

Another problem could arise 
because of power sequencing. Not all 
the operating voltages come up at the 
same time. The situation is even 
worse in redundant systems that use 
several independent power supplies. 
This sequencing delay could prove cat- 
astrophic if you don't anticipate it. 

It's no less important to ensure that the 
actuators always can be disabled. Totem 
pole, high-side switches are usually a 
good solution. It's absolutely crucial to 
analyze all of the possible faults and see 
how they will affect the rest of the sys- 
tem (FMEA) before the design is frozen. 

How do you verify that the monitor 
operates properly? You can do so by 
having the processor test the BITE. You 
can inject signals out of range or other- 
wise disallowed to. verify that the BIT 
picks them up. Some of this can be 
done only during the power-up 
sequence, after which it becomes a 
numbers game. In other words, having 
established that a certain feature 
works at power-up, such as WDT time- 
out, the probability of its failing dur- 
ing the projected operating time must 



be negligible. The probabihty of its 
failing followed by another fault that 
would, consequently, go undetected, is 
extremely improbable. This brings me 
to the topics of BIT coverage, testabili- 
ty, and BIT effectiveness. 

BIT EFFECTIVENESS 

BIT effectiveness, or fault coverage, 
is expressed as a percentage of all the 
possible faults within the device that 
the BIT will detect. Mathematically, 
the formiala is as follows: 

ED=^xI0O 

K 

where 

i-I h.l 

FD is the fault detection coverage as a 
percentage. is the failure rate for any 
one fault in the system. A. is the failure 
rate for the "i-th" identified fault. is 
the siun total failure rate for all of the 
detected faults. A, is the sum total failure 
rate for all of the faults identified in the 
system. Finally, note that K is the num- 
ber of detected failtures, and L is the num- 
ber of faults identified in the system. 

The fundamentals of testabihty and 
fault coverage can be found in MIL- 
STD-2165A, which was replaced in 
1995 with MIL-HDBK-2I65A. [1] The 
original document is hard to find, but 
it appears that httle was changed 
other than the name. The document 
states that 80% to 95% fault coverage 
is considered an acceptable result. It 
also states the obvious fact that 100% 
coverage is impossible to obtain. 

The document is not easy to under- 
stand, so it does not surprise me that 
the fault coverage number has often 
been torn out of context, misunder- 
stood, and has found its way into prod- 
uct specifications where it doesn't make 
much sense. The problem, for example, 
is that the best practically achievable 
fault coverage (95%) by itself is inade- 
quate for any system where predictable 
performance under adverse conditions is 
important. It would mean that one out 
of 20 failures may go by undetected, and 
its effects would be unpredictable. 
That's not acceptable under any circum- 
stances. And when you consider highly 
integrated systems (e.g., a single custom 



IC with a multitude of passive compo- 
nents in the I/O lines, such as EMI fil- 
ters and transient protection), the fault 
coverage number would be much worse 
because those passives are, for the most 
part, untestable by the BIT. 

Does it mean that you may have, 
say, 50% imdetectable failtires? That's 
what blindly following the book and 
doing the math tells you. Should you 
care? With the exception of some 
extremely specific military require- 
ments that are not directly related to 
safety, you really shotddn't. 

To guarantee safe performance, you 
don't need to detect that R135 is open 
eircvut or that low-pass EMI filter F71 
is no longer providing the required 
attenuation at 100 MHz because the 
ferrite bead inside it has disintegrated. 
Such a depth of fault isolation could 
make some sense for improving main- 
tainability and field r^air in the days 
of discrete components. 

Today, field repair of SMT assembhes 
is not a good idea to start with. It's 
more time and cost efficient to replace 
the entire circuit board. You need the 
BIT to detect when the system is not 
behaving properly, isolate the fault to 
the replaceable rniit level, and initiate a 
predetermined corrective action, such 
as an actuator shutdown. 

You don't need to test each compo- 
nent, but you should look at the sys- 
tem as a whole. FMEA helps identify 
all faults and makes sure they are all 
detected, albeit the majority of them 
indirectly. The component faults can 
be detected by the behavior of their 
functional block and the probability of 
their occurrence determined by the 
rehabihty prediction. Fault tree analy- 
sis is used to show that the probability 
of an uncontrolled failure effect (i.e., 
the fail-active state) is less than 
allowed for a given system. Youi spec- 
ification will customarily state that 
there shall be no dormant (or unde- 
tected) fadmes, and that the probabili- 
ty of a disallowed condition after a sin- 
gle failure does not exceed (typically) 
I X ICH* probability. 

FAIL-PASSIVE SYSTEM 

Take another look at Figure lb. 

Make sure the output correctly reflects 
the input conditions. You should create 



38 lssiwl«2 JmmiiBM 



cncurrciaAR* 



wwMr.dreiiiinltarxoni 



ultiple execution paths within the 
bftware and follow other good design 
ractices. (For more information on 
writing software, refer to my series 
J^titled "The Joys of Writing Software," 
Cizcuit Cellai 121-123.) 

In many instances, it is possible to 
-design and verify the software in such 
a way that, if the microcontroller is 
working correctly, the evmtuality of 
an incorrect output is extremely 
improbable. This means that in some 
systems the job of the external BITE 
design could be reduced to merely 
establishing that the microcontroller 
is working properly. The microcon- 
troller performs the remainder of the 
BIT function, including the external 
BITE verification. 

This is the idea behind the external 
WDT. Unfortunately, a commercially 
available WDT can never guarantee 
that the microcontroller is running 
properly. It doesn't take much imagi- 
nation to come up with a number of 
conditions that can forever toggle the 
single WDT Hne, while the rest of the 



system is off track. All it takes is a 

single bit in a multimegabyte program 
corrupted by an external perturbation. 

This problem can be solved in a 
number of ways. My favorite solution 
is for the microcontroller program to 
generate, within a predetermined time 
slot, a sequence of 4-bit tokens, which 
are fed into a modified WDT. In a well 
thougjit out concept, the probability 
of a microcontroller failure corrupting 
the output, while allowing the tokens 
to arrive on time and within the prop- 
er sequence, becomes infinitesimal. 

Although many input devices are 
BIT-able or can be sufficiently verified 
merely by their data plausibility (e.g., 
being within a certain range or occur- 
ring in combination with others), oth- 
ers need to be generated by dual- 
redundant devices not only to verify 
vaEdity, but also to guarantee that a 
valid input is available at all times. 
Devices such as dual LVDTs for dis- 
placement measurements and dual 
temperature or pressure sensors are 
commonly used in situations where. 



accurate knowledge of the value is 

critical for the system's performance. 
Therefore, the first step in a system 
design must be a hazard analysis to 
determine the criticality of individual 
fimctions and their necessary redim- 
dancy lequirements. 

When processing analog inputs and 
outputs, such as in a closed-loop posi- 
tion system, verifying only miCTOcon- 
troller performance may not be suffi- 
cient to guarantee the required safety. 
Should this happen, the monitor also 
needs to acquire raw analog signals, as 
shown in Figure lb, and process them 
internally. Often, this can be accom- 
plished using the BITE FPGA plus 
ADCs, DACs, and table look-ups. 
Most often, the BIT needs to verify 
the output to be within a certain win- 
dow only. Such single-channel, self- 
monitoring architecturCj with the 
BITE capable of shutting it down 
when a fault is detected, can be used 
in a fail-passive system. It will also 
serve as a building block for fail-opera- 
tional systems with little change. 



GSPx 2004 



T CALL FOR PAPERS } 




DIgHal 

Factofy 
Ooophyalcal Appa. 
Homeland Security 
Industrial Apps 
MlUtary Appa. 
Mobye/HandheW 



TECHNICAL REVIEW COMMITTEE 



I Dr. Ahmad Bahai 



I Dr. BniM Musicus 



^ap«nlchalis . Soulhsm Wet/iocfisf 

Un,i^rsify 



lOr.JahnRayfield 



I Dr. Heinz-JoHf SoMabuBoh 



I Dr. Wsliu Vimnnathan Texas Instruments 



Dr. Ahmad Bahai 


Sianlord University 






Jeff BlBT BerHe: 


y D&sign TftcfT., -fei^^- 


Dr. Chris Dick 


XiUnx, Inc. 


Or. JoFm Edwards 


Mofofola, !nc. 


Gene Frantz Tn 


us liislrumenls. Inc 


Ken KffiTxifsky 


The jWajritvo^-s. Inc. 


Gerald McGuire 


'■\naio<! Dsvict:::, Inc. 


Prof. Man V, Oppenheim 


M!T 


ZviKa rtozenshein 


MoioKila. Inc. 


Dr. Peter Simkens 


DSP Valfey 


Dr. Winthrop Smrth 








www.circuitcellar.com 



CIRCUIT CELLAI^ 



Issue 162 January 2004 39 




{figure 3— 77>(s is a dueZ-redundxi systm. A disa^ee- 
^pentbei»mnthefmx»ssingandmonMigchann^ 
f results In the disabling of the actuator. 



An obvious question arises: Why not 
build two identical, simplex, micro- 
controller-based channels and simply 
compare their outputs, as is depicted 
in Figure 3, instead of bothering with 
designing an FPGA BITE? In this dual- 
redimdant architecture, one channel 
would provide the control, and the 
other would be the monitor. The chan- 
nels exchange data, and if thiey both 
agree, the system remains operational. 
If they differ, the system goes passive. 

This is, in fact, another common 
way of building fail-passive systems. 
However, to avoid common mode enors 
that could be caused by design, an inter- 
nal bug in the microcontiolla:, a- software 
bug, or an external condition, it is often 
required that you at least use dissimilar 
oftware in the two channels. It's prefer- 
able to use dissimilar hardware as well. 

The cost of the development and 
certification of what amounts to two 
different controllers, usually requiring 
level A software (per DO-178B), is pro- 
hibitive when comparcd to the archi- 
tecture in Figure lb. It's easier and less 
costly to create and, more importantly, 
validate a monitor fully implemaited 
in hardware, even with the require- 
ment to foUow DO-254. [2] Well- 
designed hardware can be fully testable, 
but software rarely is. A hardware mon- 
itor may allow you to reduce the soft- 
ware criticaHty, and the self-monitoring 
channel caii be a useful building block 
for redundant, failroperational systems 
at httle extra development cost. 

Another alternative, given here 
merely for completeness, is to have 
the dual-redmidant channels fully cre- 
ated in analog circuitry (see Figure 3). 
This is an obsolete concept that pro- 
vides no ben^t of fl^bility in 
"today's envirtnunent. 

FAIL-OPERATIONAL SYSTEMS 

By employing a pair of self-monitor- 
ing controllers, you can create a fail- 



operational system that dual-redun- 
dancy by itself cannot. (Two channels 
let you determine that they do not 
agree, but you caimot tell which one is 
wrong.) You only need to slightly 
modify the monitors (that's also why 
using an FPGA is a good choice) to 
perform arbitration and switching 
between the channels. A usual alterna- 
tive is to make one channel dominant 
and the other submissive at power-up. 
The dominant channel controls the 
process, and, while relying on its own 



BIT for monitoring, it also exchanges 
status with the submissive channel 
that is running in parallel. A problem 
detected within the dominant chaimel 
causes its shutdown and control is 
taken over by the submissive channel, 
which becomes dominant. 

The microcontrollers exchange data 
via a serial interface. CAN bus is 
quite popular and reliable. But, this 
data exchange is noncritical; it's mere- 
ly a performance enhancement. The 
dominant-submissive switcdi-over 



Fighting against your PCB-Design Softwa/e? 

Here's something that will spare your time and your budget! 

Boards designed under EAGLE are found in patient 
monitoring equipment, chip cards, electric razors, hearing 
aids, automobiles and industrial controllers. They are as 
small as a thumbnail or as large as a PC motherboard. 
They are developed in one-man businesses or in large 
Industrial companies. EAGLE is being used in many of 
the top companies. The crucial reason for selecting 
EAGLE is not usually the very favorable price, but rather 
the ease of use. On top of that comes the outstanding , 
level of support, which at CadSoft is always free of 
charge, and is available without restriction to every 
cushxneh These ans the real co^ MllersI 

Schematic Capture • Board Layout 
Autorauter 




• Windows'' 



Linux J 



ffP^T^'^- :■ Windows B B re{jia[gm^; 

trademarkof ' 

■ M-cro.^ofi CoTDoratioa' 




Version 4.1 Higlilights 

► Powerful library management: 
e.g. move devices between 
libraries, base library for 
pacloges, generate padoge 
variants from other libraries. 

>■ Dynamic ratsnest during muting 

process. 
>■ Copy function in schematic. 

> Rotate components in 0.1- 
degree steps. 

>■ Blind & buried vias and pads 
with off-center drill. 

> User-defined bacl<ground color. 
>- Miter function for (rounded) 

tracks. 

► Smash for groups. 

> Measure distances between 
artiitrary points. 

► Choose altemative raster on- 

tho.-f|i; .inth AH 



EAGLE 4.1 Light is Free 

You can use EAGLE Light for testing ai 
non-commercial applications without charg 
Version is restricted to boards up to half Eu 
with a maximum of two signal layers and one schematic 
sheet. All other features correspond to those of the 
Professional Version. Download it from our Internet Site 
or order our free CD. 

If you decide in favor of the Commercial Light Version, you 
also get the reference manual and a license for commercial 
applications. The Standard Version is suitable for boards in 
Eurocard format with up to 4 signal layers (max. 99 schematic 
sheets). The Professional Version has no such limitations. 

http://www.CadSoftUSA.com 






CadSon Computer, Inc., 801 S. Federal Hlghw/ay, Delray Beach, FL 33483 
Hotline (561) 274-8355, Fox (561) 274-8218, E-Mail : info@cadsoftusa.com 



interface between the monitors/arbi- 
trators is accomplished by a hard 
connection. For obvious reasons, 
microcontroller communications 
would be suspect at this point. 
Besides, each channel must be fully 
operational with the other channel 
dead. To filter out noise, the switch- 
avei timt is usually in milliseconds, 
which most mechanical systems 
don't even notice. 

There are numerous ways to 
arrange the operation of the chan- 
nels. One typical method is to have 
both channels actively control with 
their outputs summed up. This can 
be done in many ways. For example, 
when the actuator is a current-driv- 
en torque motor (electrohydraulic 
servo valve, or EHSV), its two coils 
may be used in parallel, each fed by 
one channel. In case one of the chan- 
nels shuts down, you can either 
enter into a reduced authority mode 
by driving the torque motor at only 
50% or raise the gain of the remain- 
ing operational channel to deliver 
100%. Your choice depends on the 




Figure i—This triple-redundant system uses three 
independent computers w'rth majority vote. 

system characteristics and the tesult 
of the hazard analysis. 

Another method is to have the sec- 
ond channel in a standby or powered- 
down mode and to activate it only 
whrai it's needed to take over control. 
This way, the spare channel does not 
accumulate any run-time. But, in 
some systemis, it may take an unac- 
ceptably long time for it to come up 
and take over. Consequently, it is 
not seen too often in safety-aitical 
systems. 



When considering fail-operational 
architecture, don't forget the mechani- 
cal aspects of the electronic controller 
(i.e., the packaging design). You must 
look at it as if the controller has the 
characteristics of a dew worm: if you 
cut it in half, each half continues to 
live on its own. There must be no 
fault propagation from one channel to 
the other; communications betwerai 
them must be nonessential, and 
absolutely no components may be 
shared. Normally, a metal partition 
inside the cabinet ensures that a fire, 
for instance, in one channel cannot 
propagate to the next. Each channel 
has its own harnesses and connectors. 
Some dual-chaimel aircraft systems 
reside in two separate cabinets, each 
located in a different part of the air- 
craft to prevent a mechanical event 
from wiping out the entire systemu 

TRIPLE REDUNDANCY 

To further increase the reliability 
and availability of fail-operational sys- 
tems, triple and even higher redun- 
dancy is used, as shown in Figure 4, 



x86 Embed 




_i3S6;£mbedded system module 
, IRSaSZ 1 RS-282/485i_>18raPIO 
VIIDErFBD.iRTG, Parallel, Walchdog 
2MB RAM,--2.56' X 1.77" (Oolional 4MB' 





ssor Module & 



ICOP-6027VE 

3:5" 386 Embedded SBC 
GRT/LCD. 1 RS-232. 1 RS-232/485. K/B 
Parallel. DIskOnChip, IDE. FDD RTC, 
Watchdog, 4Me RAM. 4.01" X 5.67' 

$222.00 ea. $154.00 ea. 

Single unit Qty. 1 00 pncing 



Supported OS & Development Environmei 
DOS 

Using C/C++. DOS applicalion can be develop to run on ail ol out 
processor modules. DSociti a TCPyiP libraiy for DOS, JS provided to* 
oeveiop application witfi Internet connectivity. Sample implementation 
fcr BOOTPADHCP, FTP, SMTP HTTP TELNET S TALKareavailabiB 
for download fttxn our Wefe Site, www.dmp.coratiw/dsock 



ICOP-6050 



PC/104 336 Embedded SBO-- ■ ' 
1 RS'232. 1.RS 232/4S5ri carallel 
DiskOnCI-i' L =DC 
" 4ll1B RAIvl, 3.77' X 3 54 



$142 

Sing^,° 




$99.00 ea. 



Sled Linux 
•a^jticatiooicaftrun on aitoiouf processotrQ^ 
X-fcinuxjan.embedtfed Laiuxltemsl based on#" 
distnbotion. X-Unux is afiesd-iess .komsi aDproirSI^W 

1 iii-ifM ir*' 4 ^ y ^ 




-3P 



intelUgani control on processor 
Gonbct us tor tsstom design i OEMfODM^ia 



ICOfPTI^ilpoiogy Inc. 

Tel: (626) 444-6666: Email; ; lnfo@icoptech 

URL: wwwacoptech.com 



-iJJePis-aiSold-LeverPdr mxUflsrmafi 



sttppoflWird I 



42 IsMwin Januaiy20M 



CIRCUIT CELLMT 



wwwxiicoHc8llar,coin 



jsensofjj~ 




Microcontroller i 

BIT and arbitrator | 


„ ^f-o-j Actuators 








Microcontroller § 







Figure 5— TTie duakihannel system uses a selfritionh- 
toiing^ta acMeve M-dpeisBonal conflgurs^. 



There are many aspects to consider. 
You can use ordinary, simplex chan- 
nels, as in Figure la, or self -monitor- 
ing channels, as shown in Figure lb. 
For instance, in making the decision, 
you will have to consider the allow- 
able tracking tolerances. If the outputs 
of the three channels must track per- 
fectly, you will have to use three chan- 
nels with identical hardware and soft- 
ware, and you'll probably end up syn- 
chronizing them. But that introduces 
the higher probability of a common 
mode fault wiAin the system. I can- 
not think of a position control where 
such perfect tracking would be 
required and the accompanying danger 
of common mode faults preferable to 
having dissimilar hardware and soft- 
ware. I would also feel more comfort- 
able with the self-monitoring arrange- 
ment, especially when the cost of the 
FFGA monitor is negligible compared 
to the entire system and its criticality. 

The idea is that, the triple and more 
redtmdant systems control on the 
basis of a majority vote. Voting is 
accomphshed in an external module; 
but, the processing units often 
exchange input and status data for 
"pre-voting" as well. This internal 
data exchange, if not properly 
designed, can lead to an unwanted 
effect, when no two outputs are the 
same. This is called a Byzantine 
Generals' Problem. When designing 
the data exchange and voting scheme, 
this probability must be considered. It 
can be avoided, for example, by using 
the MVS algorithm, which selects the 
output that is between the other two. 
The external voting module can be 
implemented electronically, or it can 
be mechanical or hydraulic. This 
depends on the specifics of the system. 

We have considered what happens 



when the system experiences a single 
fault. Most fail-passive systems accept 
a 10"' probability of a critical failure 
(i.e., the fail-active state) caused by a 
single fault. This represents one fail- 
ure in more than 100,000 years. The 
probability of a dual failure is expo- 
nentially less. In real life, however, 
the impossible does happen. As 
British Prime Minister Disraeli once 
said, "there are lies, damned lies, and 
statistics." One time I experienced an 
"impossible" event that had a 10"' 
probability of happening within the 
first minute of operation of the first 
production unit. Although extremely 
embarrassing, the subsequent design 
review showed no design flaw. The 
event hasn't repeated itself for 
almost 15 years. 

What if you have a critical system 
and the statistically impossible is not 
impossible enou^? The approach is 
essentially the same as for the single 
fault: you just increase redimdancy. 
There are, for instance, triple-redim- 
dant fhght computers in which each of 
the three computers is triple redundant 
itself, using three different processors 
from three different manufacturers to 
avoid common mode failures. 

EXPECT THE UNEXPECTED 

In summary, let's review the differ- 
ent architectures, keeping in mind 

that there are many combinations of 
those basic principles. Starting with 
the plain vamlla simplex controller 
like the one in Figure la, you must 
recognize that it is unsuitable for any 
system where safety and reliability 
may be of concern. The first failure 
puts it out of commission and the 
repercussions are anybody's guess. 

A fail-passive system, composed of 
a self -monitoring controller, as shown 
in Figure lb, or a dual-redundant con- 
figuration as in Figure 3, wiU satisfy 
many requirements. After the first 
failure, it will shut down or assvime 
some other predetermined state with 
a probabihty of usually less than 10"' 
of going fail-active. After the system 
deactivates, a second failure should 
not be a concern. 

A dual self -monitoring system like 
the one in Figure 5 will continue to 
operate after the first failure; it goes 



fail-passive after the second failure. 
Triple-redundant (and higher) systems 
will potentially remain fail-operational 
imtil the last channel, which will be 
fail-passive. 

Finally, when designing an embed- 
ded controller, it is extremely impor- 
tant to consider f aUiires that can affect 
all of the channels at once. A good 
example is a signal corruption, which 
can be the result of a ligjitning strike, 
HIRF, mechanical failure, etc. Such 
faults differ from simple component 
failures in that they can and usually 
do affect the processing channels 
simultaneously. Therefore, it is impor- 
tant to understand how the system 
will behave under such circumstances 
and make sure that it does not become 
fail-active. Ensuring RF immunity and 
fuU functionaUty at 200 V/m is one 
thing you can do. But, what happens if 
the system is irradiated by a burst of 
400 V/m? Or what if the shielding gets 
disconnected? You must always con- 
sider: the unexpected and make sure 
the results are predictable. B 

George Novacek has 30 years of expe- 
rience in circuit design and embed- 
ded contzollers. He is cuizently the 
general manager of Hispano-Suiza 
Canada, a division of the Snecma 
Group, the world's leader in manu- 
factxiiing propulsion and landing gear 
systems. You may reach him at gno- 
vacek@nexicom.net. 



REFERENCES 



[1] U.S. Department of Defense, 
"Testabihty Program for Systems 
and Equipments," MIL-HDBK- 
2165, 1995. 

[2] RTCA, Inc., "Design Assmrance 
Guidance For Airborne 
Electronic Hardware," DO-254, 
2000. 



RESOURCES 



G. Novacek, "A Sure Thing: 
Guaranteeing 99.99999% 
Reliability," Ciicait Cellar 129. 

— , "Designing for Reliability, 

Maintainability, and Safety," 
Circuit Cellar 125 and 126. 



wwWidrcuKealiWumn 



lsMwii2 JntmyStlM 43 



